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ABSTRACT 

While mylch has been written/ on the topic of 
criterion-referenced testing and consequently its comparison with^ 
norm- referenced testing^ measurement specialists have not as readily 
approached the subject of the implications involved in reporting such 
test information. The purposes of this paper are to (1) draw 
distinctions between criterion and norm-ref erenced assessment; (2) 
delineate the purposes for which uses of test information are 
employed; and (3) evaluate the usefulness of criterion and 
no^m-referenced jBeasurement in providing the\ necessary data for each 
test information use* The six major uses of test information included 
in the study are: prognosis, diagnosis of learning difficulty, 
student growth, student achievement, program evaluation and research. 
The six uses of test information outlined abov€(^ constitute the basic 
requirement needs of measurement specialists. In an earlier work, 
Cronbach (1949), classified testing under three \main headings: 
prognosis, diagnosis, and research; three additional uses of test 
information have been added: growth, achievem^ant and program 
evaluation. Such an evaluation should hopefully provide school 
personnelT with the understanding of the distinction between the types 
of information available from both criterion and norm-referenced 
testing. Such knowledge should help to protopte the understanding that 
certain measurement information is better a/ssessed by one type of 
assessment tool rather than the other depending upon the 
decision-making purpose in question. (Authoi^/DEP) 
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Measuremeat specialists havtf^een exploring the advantages and dis- 

I 

advantages of criterion-referenced testing since its introduction to 
the educational community. As a consequence of its popularity, there 
have also been numerous comparisons drawn on many dimensions with 
norm- referenced testing. Some experts, for example. Block (l97l), 
Glaser & Nitko (l970) and Popham & Husek (1969), contend that it is not 
possible to distinguish between a criterion-referenced and norm- 
referenced test by just looking at it. Others, Ebel (l97l), Davis 
{1970) and Simon (1969), assert that the basic difference between the 
two types of instruments lies in the interpretations given to the scores 
of tests. 

The measurement specialists do not appear to have approached the 
critical analysis of the implications involved in reporting such criterion- 
referenced test information. The position taken in this paper is that 
there are basic inherent differences in the intent and construction of 
both types of tests. Given these basic differences, it is further 
believed that the issue of criterion-referenced versus norm- referenced 
testing lies with the uses for which testing information will be employed; 
that is, rather than viewing and comparing such tests in an "either/or" 
context, one should base the evaluation upon the decision-roakir^ purpose 
for which testing information is needed. 
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Keeping the above in mind, the purposes of this paper are three- 
fold: l) To draw distinctions between criterion and norm-referenced 
assessment; 2) To delineate the purposes for which most uses of test 
information are employed; and 3) To evaluate the usefulness of criterion 
and norm-referenccd measurement in providing tlfe necessary data for each 
_test infonriation use. The six major uses of test infomation included 
in this study are: prognosis, diagnosis, research, program evaluation, 
achievement, and growth. 

Characteristics of Cri terion and Norm- Re fere need Measurements 

Criterion- referenced assessment contains certain basic characteristics 
commonly ascribed to such instruments: statements of specific instructional 
objectives are listed for particular content areas; specific mastery levels 
are predetermined for each objective or test; learners' responses are 
measured ag^^inst these predetermined criterion levels- In addition, a 
designation is made as to which objectives have been mastered by each 
individual. Some optional characteristics found in many criterion- 
rpferenced assessment strategies include refinement of outcome status, 
Jummary data on the group measured and possibly the grouping of individuals 

for remedial instruction, 

i 

Refinement of outcome status offers more detailed information on 
lobjectives than the dichotomy, "Pass/Fail". Such categories as 
; "Exceeded Mastery Level", "Achieved Mastery Level" and "Below Mastery 
lx2vcl" with entrie.^ establishing by how much scores deviate from 
t,he mastery level provide more information especially Cor advanced 
instruction or remediation purposes. 

Summary data for group administrations provide uncTul inronnatiori on 
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particular objectives. For exanrple, if it is known that only of 
the learners in a classroom mastered a. particular objective, one 
wotild assume that cither instruction on this objective was not effective 
or that the test itbms did not adequately measure the objective. Such 
summaries are gathered across individuals on particular objectives; 
suHVflative information for individual examinees across all objectives 
on a test should not be reported. To provide a total score on diverse 
sld.lls would not supply the relevant information of whether individuals 
have mastered specific objectives. 

Summary information may also be used for the purposes of grouping 
for remedial instruction. If it is discovered that for a particular 
group of objectives certain individuals consistently do not achieve 
masterjS then this Information may be used to set up smiLl informal 
groups to provide cnriclunent opportunities. Thus, it is hoped that 
upon second testing, mastery of the objectives will be achieved. 

In norm^ref ere need assessment explicit instructional objectives 
are not specified. Therefore, it is' not generally possible to determine 
whethcrr or not a learner has mastered particular skills. Although 
standardized norm-referenced tests usually report subtest scores, 
these subtest scores cover broad numbers of objectives uiider one sub- 
heading such as "Mathematics Concepts." One cannot tell which specific 
objectives constitute a "Mathematics Concept.:" siibtest without the test 
blueprints. 

Another characteristic of norm-referenced assessment is the use of 
methods of item analysis to select items which discriminate best among 
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cxaiiiiaces. Items vith very high or very low item difficiilty are 
usually removed from such tests, especially those standardized on 
national samples of students. la a criterion-referenced testing context, 
the isstae of discrimination is between pre and posttest results, i.e. 
barring prior knowledge, examinees should not be able to ansvrer test 
questions before instruction but upon completion of instruction (if it 
has been effective) they should achieve mastery of all objectives. 

r 

A third primary characteristic of norm-referenced assessment is 
the comparison of an individual's score to that of others. In standardized 
nom-referencod tests raw scores are converted to some form of interpreted 
score for comparison purposes; in teacher-made tests the comparisons may 
be as simple as comparing Student A*s total raw score with that of Student 
B. In any event, the emphasis is usually upon comparing total test scores 
and consequently assessing which examinee possesses moire knowledge as 
identified by correct resi>on5es to test questions. 
Purposes for Which Test Information Are Employed 

In an earlier work, Cronbach (l9^9) classified testing under three 
main headings: prognosis, diagnosis, and research; three additional uses 
of test information have been added: growth, achievement, and program 
evaluation. While it may be argued by some that the three additional 
areas could justifiably be included within the framework of Cronbach* s 
original three, discussion will hopefully show that they provide 
significant differences in intent to justifV their individual categorization 

Prognostic testing is defined as the use of assessment techniques to 
predict the success of an individual in some futuie undertaking. An 
instrument is administered which has been found to successfully distinguish 



those individuals who show the presence of an ability, trait, etc, 
vTdch has been known to correlate highly with success on some future task. 
A widely known instrument of this kind is the Scholastic Aptitude Test 
taken by many j\iniors and seniors wishing to gain admission to college. 
The basis upon which interpretations of results are formulated rests 
with the admission policies set by most schools; selection of candidates 
is usually based upon certain criteria, for example, a score of 600 
or more on both the quantitative and verbal subtests of the SAT. 
Selection of those candidates who show the mosit potential (based upon 
their total test scores) is the decision-making purpose in this instance. 

The dlaj;cnosis of learning difficulty constitutes a second major 
purpose for which information is required. Information is needed to 
asiess the level of expertise of an individual on particular skills con- 
si/dered important. Testing inquires into the particular patterns of 
correet-^d incorrect responses to questions, thereby providing informa- 
tion upon which remedial instruction can be based. The purpose of 
testing and the uses to which the test information will be employed, are 
based upon vhat a particular individual can and cannot do, A pre-reading 
test of psycho-motor ability. would be an example of a diagnostic assess- 
ment tool. The teacher is interested in assessing what particular psycho- 
motor skills a learner has acquired and on which skills more time rmsi 
be spent. 

Research and programs evaluation are chosen to be discussed jointly 
because of their apparent similarities. Testing for research purposes 
involves using instruments to gather quantitative data concerning achieve- 
ment, aptitude, etc, for the purpose of hypothesis testing. Testing 



involves dravdng a representative sample of subjects who are administered 
tests relevant to the hypotheses in question. The researcher is usually 
interested in differential performance of groups or individuals on the 
test, A hypothetical example of testing for research purpose would be 
the case of an experimenter who hypothesizes that \inder standing mechanical 
relationships is correlated with tough-mindedness, practicality and careful 
planning. To test this hypothesis, the experimenter would draw a random 
sample of subjects and administer tests which appear to validly and 
reliah|ly assess the variables in question, for.examplc, such tests as 
the Bennett Mechanical Compprehension Test and the Thorndike Dimensions 
of Temperament test. The results of the analysis would be used to 
test the hypothesis of the relationship of mechanical comprehension 
and pragmatism to a particular significance level. 

The use of test information for evaluation purposes is seen as a 
special case of testing for research. Program evaluation usually depends 
upon the classroom unit as the basis for sampling. Random sampling is 
exceedingly rare due to the cost, in both time and money, of such operations. 
An additional difference between testing for research and program evaluation 
is the purpose for which both ate intended. While testing in a research 
context will have as its final outcme the acceptance or rejection of par- 
ticular hypotheses, program evaluation results will be used fot decision- 
making purposes concerning the program under study; whether it will be 
continued or suspended, whether or not increased funds will be appropriated 
for its further development, etc. Generally, such decision-making IMnctions 
are more applied than the testing of most research hypotheses. 



Pi'ocrajn evaluation also depends upon a unit (or units) of instruc- 
tion being administered and tests given to assess whether the instruc- 
tion has been effective. A very ccnnmon approach is >to administer 
alternate forms as pre and posttests to insure that achievement Is a 
product of the instruction. Attitudinal instruments may also he 
administered to check upon the effect that the prograin has on the thoughts, 
feelings, behaviors, etc. of the participants. For program evaluation 
to be most effective and provide the most meaningful data, the instniments 
used must le f;eared directly to^^ard the instruction; i.e., the test items 
used mu:?t relate to the speciric curriculum objectives. 

Testing to assess achievement is probably the most well known pur- 
pose for which tests are utilised. An area of interest is delineated, a 
sample of tasks is drawn and assessment is made at a point in time of 
how well an Individual does oh those tasks. Bro\m (l970) lists three 
properties that must be present in order to classify a test as an 
achievement instrument: 

1 . The skill and content domains covered by the test can be 
specified and defined in behavioral terms; 

2. The test does, in fact, measure these important behaviors 
rather than irrelevant considerations; and 

3. The test takers have had equal exposure, or equal oppor- 
tunities for exposure, to the material being tested VPP- 253-25^). 

A simplified example of an achievement measure would be a teacher-made 

test to assess the performance of learners on a particular unit (or^units) 

of instruction. The teacher deJineates objectives important in the 

instruction and attempts to construct items which directly measure those 

objectives of instruction. 
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The last purpose of testing to be explored hero is the assessment 
of grovth t Angoff (l97l) defines grcrvrth as "... an increment in score 
associated with the passage of tmo ..." While many methodological \ 
considerations must be taken in account by those proposing such studies, 
(for example: parallel forms of a test, test score equating, distinction 
between grovth and practice effect, reliability adequacy, etc.) such pro- 
cedures if carefully carried out, offer a powerful tool in assessing 
the status of an individual or group over a period of time. 

Examples of measurement to assess grovrth must necessarily deal 
with the methodological cbnsiderations noted above and would apply to 
any test used for that purpdse. A test administrator mustT take care 
that the tests given at the beginning end end of the period of time in * 
question measure the some function. (It is assumed that some inter* 
vening occurrance has taken place in the period of time between testings. ) 
Tlie two tests must have the same units e/^ressed (test scores must be 
equated). In addition one must pay particular attention to practice 
.Effects. A partial solution would be to provide alternative forms for 
first and second testing injuring that a reasonable length of time has 
elapsed between testings. ^ 
Information Needed as the Basis for Dectsion^Making 

The discussion of the six test information uses points out the fact 
that different testing situations require different kinds of information. 
Generally we may dichotomize test information needs into "objective- 
specific" and "total- comparative". Objective- si)eclfic test information 
appears to be a'ssessed much better by utilizing rules for constructing 
criterion-referenced tests while total-comparative information is the 
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"domsCin of norm-referenced testing. Although it could be possible to 
supply criterion-referenced test information for a nom-referenced 
use (and vice versa), it appears that they both provide different 
kinds of information more relevant to some data needs than others. 

More specifically, for prognosis testing, infomation is 
needed to differentiate or sort individuals into two classifications: 
acceptees and rejectees. A cut-off point may be established for each 
particular situation; the 4ntent is to select those individuals whose 
total testv score meets or exceeds a certain score regardless of the 
individual items or patterns of items correct. A norm-referenced 
approach seems most suited for this purpose. 

Testing for diagnostic purposes has as its main goal the discovery 
of patterns of correct and incorrect responses of items rf \ated to 
specific skills. An examination of a learijer's test rosults should , 
show in which areas the learner excels or Is d^icient. In order to 
answer such questions, the test items must be directly related to the 

o 

skills considered important. A criterion-referenced assessment approach 
with test items directly relating to specific skills or objectives would 

/appear to be the best choice in this particular situation, 

0 

In general, data gathering for the purpose of hypothesis testing 
relies upon the assumption that different groups or individuals possess 

traits, abilities, attitudes, etc. in varying amounts which when 

/ 

related with other variables will produce significant differences for 
hypotheses posed. The researcher's interest is in distinguishing 
among groups or Individuals on their test scores thus helping to 
clarifV their results for hypothesis testing. For such purposes of 
separating or categorizing individuals on the basis of test scores, 
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norm or comparison-referenced testing appears to be most suited. 
Little interest is paid to the individual responses individuals 
made; the emphasis is on total test score results. 

To assess prograia effectiveness, an e valuator is most interested 
in determining whether the program under study was effective in 
reaching its program objectives. To accurately assess such a situation, 
the instruments used in the evaluation must be directly related to 
the goals of the program. Criterion-referenced testing with its ^ 
commitment to items directly' related to specific objectives seems 
well suited for this purpose. ^ 

Likewise with achievement testing, there "... skill and content 
domains covered by the test (should) be specified and defined in 
behavioral terms" and "the test (should) measure important behaviors 
rather than irrelevant considerations", criterion- referenced testing v 
appears better suited for assessing the tr\ie state of affairs. It 'should 
be noted though that in testing to assess achievement, test administrators 
are often interested in drawing comparisons amoiig classrocmi groups or 
individuals for purposes of dLccision making. In such instances, in addition 
to the primary criterion- referenced interpretations the data may also be 
used to extract comparisot\s. 

Assessing growth in an individual is closely aligned to measuring 
achievement* Tlie primary difference is in the time frame ured for both 
procedures. While achievement testing is sampling behavior at a certain 
.point of time, growth measurement attempt to sample achievement over a 
period of time. Given the requirements listed above that must be taken 
into consideration when growth studies are proposed, the tests used to 
assess growth should have the characteristics listed for achievement 



testing; i.e., they should primarily be criterion-referenced . As was 



true with testing for achievement purposes, comparisons may be requested 
between classroans or individual's, but the primary purposes of such 



mcas^Jirement should not be forcsaJken. 



In summary, the work reported herein should hopefully pro\ride 



school personnel with the understanding of the distinctions between the 
typos of information Available from both criterion and norm-referenced 
testing. Such knciwledge sho'\ld help to promote the understanding that 



certain measurement' informatipn is better assessed by one type of 
assessment tool rattier than the^other depending upon the decision- 
making purpose in (iuestion. ^ 

> 
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